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^ Abstract 
H 

The bootstrap provides a simple and powerful means of assessing the quality of esti- 
mators. However, in settings involving large datasets — which are increasingly prevalent— 
the computation of bootstrap-based quantities can be prohibitively demanding com- 
putationally. While variants such as subsampling and the m out of n bootstrap can be 
used in principle to reduce the cost of bootstrap computations, we find that these meth- 
ods are generally not robust to specification of hyperparameters (such as the number 
of subsampled data points) , and they often require use of more prior information (such 
as rates of convergence of estimators) than the bootstrap. As an alternative, we intro- 
duce the Bag of Little Bootstraps (BLB), a new procedure which incorporates features 
of both the bootstrap and subsampling to yield a robust, computationally efficient 
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means of assessing the quality of estimators. BLB is well suited to modern parallel and 
distributed computing architectures and furthermore retains the generic applicability 
and statistical efficiency of the bootstrap. We demonstrate BLB's favorable statistical 
performance via a theoretical analysis elucidating the procedure's properties, as well 
as a simulation study comparing BLB to the bootstrap, the m out of n bootstrap, and 
subsampling. In addition, we present results from a large-scale distributed implemen- 
tation of BLB demonstrating its computational superiority on massive data, a method 
for adaptively selecting BLB's hyperparameters, an empirical study applying BLB to 
several real datasets, and an extension of BLB to time series data. 



1 Introduction 



The development of the bootstrap and related resampling-based methods in the 1960s and 
1970s heralded an era in statistics in which inference and computation became increasingly 



intertwined (Efron, 1979 Diaconis and Efron, 1983). By exploiting the basic capabilities of 



the classical von Neumann computer to simulate and iterate, the bootstrap made it possible 
to use computers not only to compute estimates but also to assess the quality of estimators, 



yielding results that are quite generally consistent (Bickel and Freedman, 1981 Gine and 



Zinn, 1990 van der Vaart and Wellner, 1996) and often more accurate than those based 



upon asymptotic approximation (Hall, 1992). Moreover, the bootstrap aligned statistics to 



computing technology, such that advances in speed and storage capacity of computers could 
immediately allow statistical methods to scale to larger datasets. 

Two recent trends are worthy of attention in this regard. First, the growth in size 
of datasets is accelerating, with "massive" datasets becoming increasingly prevalent. Sec- 
ond, computational resources are shifting toward parallel and distributed architectures, with 
multicore and cloud computing platforms providing access to hundreds or thousands of pro- 
cessors. The second trend is seen as a mitigating factor with respect to the first, in that 
parallel and distributed architectures present new capabilities for storage and manipulation 
of data. However, from an inferential point of view, it is not yet clear how statistical method- 
ology will transport to a world involving massive data on parallel and distributed computing 
platforms. 

While massive data bring many statistical issues to the fore, including issues in ex- 
ploratory data analysis and data visualization, there remains the core inferential need to 
assess the quality of estimators. Indeed, the uncertainty and biases in estimates based on 
large data can remain quite significant, as large datasets are often high dimensional, are 
frequently used to fit complex models with large numbers of parameters, and can have many 
potential sources of bias. Furthermore, even if sufficient data are available to allow highly 
accurate estimation, the ability to efficiently assess estimator quality remains essential to 
allow efficient use of available resources by processing only as much data as is necessary to 
achieve a desired accuracy or confidence. 

The bootstrap brings to bear various desirable features in the massive data setting, 
notably its relatively automatic nature and its applicability to a wide variety of inferential 
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problems. It can be used to assess bias, to quantify the uncertainty in an estimate (e.g., 
via a standard error or a confidence interval), or to assess risk. However, these virtues are 
realized at the expense of a substantial computational burden. Bootstrap-based quantities 
typically must be computed via a form of Monte Carlo approximation in which the estimator 
in question is repeatedly applied to resamples of the entire original observed dataset. 

Because these resamples have size on the order of that of the original data, with approx- 
imately 63% of data points appearing at least once in each resample, the usefulness of the 
bootstrap is severely blunted by the large datasets increasingly encountered in practice. In 
the massive data setting, computation of even a single point estimate on the full dataset 
can be quite computationally demanding, and so repeated computation of an estimator on 
comparably sized resamples can be prohibitively costly. To mitigate this problem, one might 
naturally attempt to exploit the modern trend toward parallel and distributed computing. 
Indeed, at first glance, the bootstrap would seem ideally suited to straightforwardly lever- 
aging parallel and distributed computing architectures: one might imagine using different 
processors or compute nodes to process different bootstrap resamples independently in par- 
allel. However, the large size of bootstrap resamples in the massive data setting renders this 
approach problematic, as the cost of transferring data to independent processors or compute 
nodes can be overly high, as is the cost of operating on even a single resample using an 
independent set of computing resources. 

While the literature does contain some discussion of techniques for improving the com- 
putational efficiency of the bootstrap, that work is largely devoted to reducing the number 
of resamples required (Efron, 1988 Efron and Tibshirani, 1993). These techniques in general 
introduce significant additional complexity of implementation and do not eliminate the crip- 
pling need for repeated computation of the estimator on resamples having size comparable 
to that of the original dataset. 



Another landmark in the development of simulation-based inference is subsampling (Poli- 
tis et al.j |1999 ) and the closely related m out of n bootstrap (Bickel et al. 1997). These 



methods (which were introduced to achieve statistical consistency in edge cases in which the 
bootstrap fails) initially appear to remedy the bootstrap's key computational shortcoming, 
as they only require repeated computation of the estimator under consideration on resam- 
ples (or subsamples) that can be significantly smaller than the original dataset. However, 
these procedures also have drawbacks. As we show in our simulation study, their success is 
sensitive to the choice of resample (or subsample) size (i.e., m in the m out of n bootstrap). 
Additionally, because the variability of an estimator on a subsample differs from its vari- 
ability on the full dataset, these procedures must perform a rescaling of their output, and 
this rescaling requires knowledge and explicit use of the convergence rate of the estimator 
in question; these methods are thus less automatic and easily deployable than the boot- 
strap. While schemes have been proposed for data-driven selection of an optimal resample 
size (Bickel and Sakov, 2008), they require significantly greater computation which would 



eliminate any computational gains. Also, there has been work on the m out of n bootstrap 
that has sought to reduce computational costs using two different values of m in conjunc- 
tion with extrapolation (Bickel and Yahav, 1988; Bickel and Sakov, 2002). However, these 
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approaches explicitly utilize series expansions of the estimator's sampling distribution and 
hence are less automatically usable; they also require execution of the m out of n bootstrap 
for multiple values of m. 

Motivated by the need for an automatic, accurate means of assessing estimator quality 
that is scalable to large datasets, we introduce a new procedure, the Bag of Little Bootstraps 
(BLB), which functions by combining the results of bootstrapping multiple small subsets of 
a larger original dataset. Instead of applying an estimator directly to each small subset, as 
in the m out of n bootstrap and subsampling, the Bag of Little Bootstraps (BLB) applies 
the bootstrap to each small subset, where in the resampling process of each individual boot- 
strap run, weighted samples are formed such that the effect is that of sampling the small 
subset n times with replacement, but the computational cost is that associated with the 
size of the small subset. This has the effect that, despite operating only on subsets of the 
original dataset, BLB does not require analytical rescaling of its output. Overall, BLB has 
a significantly more favorable computational profile than the bootstrap, as it only requires 
repeated computation of the estimator under consideration on quantities of data that can be 
much smaller than the original dataset. As a result, BLB is well suited to implementation 
on modern distributed and parallel computing architectures which are often used to process 
large datasets. Also, our procedure maintains the bootstrap's generic applicability, favor- 
able statistical properties (i.e., consistency and higher-order correctness), and simplicity of 
implementation. Finally, as we show in experiments, BLB is consistently more robust than 
alternatives such as the m out of n bootstrap and subsampling. 

The remainder of our presentation is organized as follows. In Section |2j we formalize our 
statistical setting and notation, present BLB in detail, and discuss the procedure's compu- 
tational characteristics. Subsequently, in Section [3j we elucidate BLB's statistical properties 
via a theoretical analysis (Section 3.1) showing that BLB shares the bootstrap's consistency 
and higher-order correctness, as well as a simulation study (Section 3.2) which compares 
BLB to the bootstrap, the m out of n bootstrap, and subsampling. Section [4] discusses a 
large-scale implementation of BLB on a distributed computing system and presents results 
illustrating the procedure's superior computational performance in the massive data setting. 
We present a method for adaptively selecting BLB's hyperparameters in Section [5j Finally, 
we apply BLB (as well as the bootstrap and the m out of n bootstrap, for comparison) to 
several real datasets in Section [6j we present an extension of BLB to time series data in 
Section [7], and we conclude in Section |8j 



2 Bag of Little Bootstraps (BLB) 
2.1 Setting and Notation 

We assume that we observe a sample Xi, . . . ,X n G X drawn i.i.d. from some (unknown) 
underlying distribution P G V; we denote by P n = rT x J2i=i ^ the corresponding empirical 
distribution. Based only on this observed data, we compute an estimate 9 n G 6 of some 
(unknown) population value 8 G associated with P. For example, 9 n might estimate a 



4 



measure of correlation, the parameters of a regressor, or the prediction accuracy of a trained 
classification model. When we wish to explicitly indicate the data used to compute an 
estimate, we shall write 0(P n ). Noting that 9 n is a random quantity because it is based on n 
random observations, we define Q n (P) G Q as the true underlying distribution of 9 n , which 
is determined by both P and the form of the estimator. Our end goal is the computation of 
an estimator quality assessment £(Q n (P), P) : Q x V — > 5, for 5 a vector space; to lighten 
notation, we shall interchangeably write £,{Q n (P)) in place of (,(Q n (P), P). For instance, £ 
might compute a quantile, a confidence region, a standard error, or a bias. In practice, we 
do not have direct knowledge of P or Q n (P), and so we must estimate £(Q«(-P)) itself based 
only on the observed data and knowledge of the form of the estimator under consideration. 

Note that we allow £ to depend directly on P in addition to Q n (P) because £ might 
operate on the distribution of a centered and normalized version of 9 n . For example, if £ 
computes a confidence region, it might manipulate the distribution of the statistic \/n(9 n — 9), 
which is determined by both Q n (P) and 9; because 9 cannot in general be obtained directly 
from Q n (P), a direct dependence on P is required in this case. Nonetheless, given knowledge 
of Q n (P), any direct dependence of £ on P generally has a simple form, often only involving 
the parameter 9. Additionally, rather than restricting Q n (P) to be the distribution of 9 n , we 
could instead allow it to be the distribution of a more general statistic, such as (9 n , a n ), where 
a n is an estimate of the standard deviation of 9 n (e.g., this would apply when constructing 
confidence intervals based on the distribution of the studentized statistic (9 n — 9)/a n ). Our 
subsequent development generalizes straightforwardly to this setting, but to simplify the 
exposition, we will largely assume that Q n (P) is the distribution of 9 n . 

Under our notation, the bootstrap simply computes the data-driven plugin approximation 
£(Qn{P)) ~ £(QnQPn))- Although £(<3nQPn)) cannot be computed exactly in most cases, 
it is generally amenable to straightforward Monte Carlo approximation via the following 



algorithm (Efron and Tibshirani, 1993): repeatedly resample n points i.i.d. from P n , compute 
the estimate on each resample, form the empirical distribution Q* of the computed estimates, 
and approximate £(Q n (P)) « £(Q*). 

Similarly, using our notation, the m out of n bootstrap (and subsampling) functions as 



follows, for m < n (Bickel et al. 1997 Politis et al. 1999): repeatedly resample m points i.i.d. 



from P n (subsample m points without replacement from Xi, . . . , X n ), compute the estimate 
on each resample (subsample), form the empirical distribution of the computed estimates, 
approximate £(Q m {P)) ~ ^(Q„), and apply an analytical correction to in turn approximate 
£,{Qn{P))- This final analytical correction uses prior knowledge of the convergence rate of 
9 n as n increases and is necessary because each value of the estimate is computed based on 
only m rather than n points. 

We use Id to denote the d- dimensional vector of ones, and we let Id denote the d x d 
identity matrix. 

2.2 Bag of Little Bootstraps 

The Bag of Little Bootstraps (BLB) functions by averaging the results of bootstrapping 
multiple small subsets of X%, . . . , X n . More formally, given a subset size b < n, BLB samples 
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s subsets of size b from the original n data points, uniformly at random (one can also 
impose the constraint that the subsets be disjoint). Let X\,...,I S C {l,...,n} be the 
corresponding index multisets (note that \Xj\ = b, Vj), and let = b^ 1 ^2 ieX . <5x; be the 
empirical distribution corresponding to subset j. BLB's estimate of £(Q n (P)) is then given 
by 

s 



Although the terms 6(Qn.(P^],)) * n (^J) carm °t be computed analytically in general, they can 
be computed numerically via straightforward Monte Carlo approximation in the manner of 
the bootstrap: for each term j, repeatedly resample n points i.i.d. from P^|, compute the 
estimate on each resample, form the empirical distribution Q* ■ of the computed estimates, 

and approximate £(Q„(P$)) « t(Q* n ,j)- 

Now, to realize the substantial computational benefits afforded by BLB, we utilize the 
following crucial fact: each BLB resample, despite having nominal size n, contains at most 
b distinct data points. In particular, to generate each resample, it suffices to draw a vector 
of counts from an n-trial uniform multinomial distribution over b objects. We can then 
represent each resample by simply maintaining the at most b distinct points present within 
it, accompanied by corresponding sampled counts (i.e., each resample requires only storage 
space in 0(b)). In turn, if the estimator can work directly with this weighted data repre- 
sentation, then the computational requirements of the estimator — with respect to both time 
and storage space — scale only in b, rather than n. Fortunately, this property does indeed 
hold for many if not most commonly used estimators, such as general M-estimators. The 
resulting BLB algorithm, including Monte Carlo resampling, is shown in Algorithm [TJ 

Thus, BLB only requires repeated computation on small subsets of the original dataset 
and avoids the bootstrap's problematic need for repeated computation of the estimate on 
resamples having size comparable to that of the original dataset. A simple and standard 



calculation (Efron and Tibshirani 1993) shows that each bootstrap resample contains ap- 
proximately 0.632n distinct points, which is large if n is large. In contrast, as discussed 
above, each BLB resample contains at most b distinct points, and b can be chosen to be 
much smaller than n or 0.632n. For example, we might take b = n 1 where 7 £ [0.5, 1]. More 
concretely, if n = 1,000,000, then each bootstrap resample would contain approximately 
632, 000 distinct points, whereas with b = n os each BLB subsample and resample would 
contain at most 3,981 distinct points. If each data point occupies 1 MB of storage space, 
then the original dataset would occupy 1 TB, a bootstrap resample would occupy approx- 
imately 632 GB, and each BLB subsample or resample would occupy at most 4 GB. As a 
result, the cost of computing the estimate on each BLB resample is generally substantially 
lower than the cost of computing the estimate on each bootstrap resample, or on the full 
dataset. Furthermore, as we show in our simulation study and scalability experiments below, 
BLB typically requires less total computation (across multiple data subsets and resamples) 
than the bootstrap to reach comparably high accuracy; fairly modest values of s and r suffice. 
Due to its much smaller subsample and resample sizes, BLB is also significantly more 
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Algorithm 1: Bag of Little Bootstraps (BLB) 



Input: Data Xi, . . . , X n b: subset size 

9: estimator of interest s: number of sampled subsets 

£: estimator quality assessment r: number of Monte Carlo iterations 

Output: An estimate of £(Q n (P)) 

for j <— 1 to s do 

// Subsample the data 

Randomly sample a set X = . . . , %} of b indices from {1, . . . , n} without 
replacement 

[or, choose X to be a disjoint subset of size b from a predefined random partition of 
{l,...,n}] 

// Approximate £(Q„(P$)) 
for k <— 1 to r do 

Sample (ni, . . . , n&) ~ Multinomial(n, lb/b) 

K,k^n-*j: b a=1 n a 5 Xia 

end 
end 

// Average values of £(<3n0P^|)) computed for different data subsets 
return s" 1 EJ=i Cj 



amenable than the bootstrap to distribution of different subsamples and resamples and their 
associated computations to independent compute nodes; therefore, BLB allows for simple 
distributed and parallel implementations, enabling additional large computational gains. In 
the large data setting, computing a single full-data point estimate often requires simultaneous 
distributed computation across multiple compute nodes, among which the observed dataset 
is partitioned. Given the large size of each bootstrap resample, computing the estimate on 
even a single such resample in turn also requires the use of a comparably large cluster of 
compute nodes; the bootstrap requires repetition of this computation for multiple resamples. 
Each computation of the estimate is thus quite costly, and the aggregate computational 
costs of this repeated distributed computation are quite high (indeed, the computation for 
each bootstrap resample requires use of an entire cluster of compute nodes and incurs the 
associated overhead). 

In contrast, BLB straightforwardly permits computation on multiple (or even all) sub- 
samples and resamples simultaneously in parallel: because BLB subsamples and resamples 
can be significantly smaller than the original dataset, they can be transferred to, stored by, 
and processed on individual (or very small sets of) compute nodes. For example, we could 
naturally leverage modern hierarchical distributed architectures by distributing subsamples 



7 



to different compute nodes and subsequently using intra-node parallelism to compute across 
different resamples generated from the same subsample. Thus, relative to the bootstrap, 
BLB both decreases the total computational cost of assessing estimator quality and allows 
more natural use of parallel and distributed computational resources. Moreover, even if only 
a single compute node is available, BLB allows the following somewhat counterintuitive pos- 
sibility: even if it is prohibitive to actually compute a point estimate for the full observed 
data using a single compute node (because the full dataset is large), it may still be possible 
to efficiently assess such a point estimate's quality using only a single compute node by 
processing one subsample (and the associated resamples) at a time. 

Returning to equation ([T|, unlike the plugin approximation ^(Qnffin)) used by the boot- 
strap, the plugin approximations £(Qn(J?n\)) used by BLB are based on empirical distri- 
butions which are more compact and hence, as we have seen, less computationally 

demanding than the full empirical distribution P n . However, each is inferior to P n as an 
approximation to the true underlying distribution P, and so BLB averages across multiple 
different realizations of P^ to improve the quality of the final result. This procedure yields 
significant computational benefits over the bootstrap (as discussed above and demonstrated 
empirically in Section [4]), while having the same generic applicability and favorable statis- 
tical properties as the bootstrap (as shown in the next section), in addition to being more 
robust than the m out of n bootstrap and subsampling to the choice of subset size (see our 
simulation study below). 



3 Statistical Performance 

3.1 Consistency and Higher- Order Correctness 

We now show that BLB has statistical properties — in particular, asymptotic consistency 
and higher-order correctness — which are identical to those of the bootstrap, under the same 
conditions that have been used in prior analysis of the bootstrap. Note that if 9 n is consistent 
(i.e., approaches 9 in probability) as n — > oo, then it has a degenerate limiting distribution. 
Thus, in studying the asymptotics of the bootstrap and related procedures, it is typical 
to assume that £ manipulates the distribution of a centered and normalized version of 9 n 
(though this distribution is still determined by Q n {P) and P). Additionally, as in standard 
analyses of the bootstrap, we do not explicitly account here for error introduced by use of 
Monte Carlo approximation to compute the individual plugin approximations ^,(Qn(^n\))- 

The following theorem states that (under standard assumptions) as b, n — > oo, the es- 
timates s -1 Yuj=i £(Qn0Pn]>)) turned by BLB approach the population value £{Qn(P)) in 
probability. Interestingly, the only assumption about b required for this result is that b — > oo, 
though in practice we would generally take b to be a slowly growing function of n. 

Theorem 1. Suppose that 9 n = 0(P n ) and 9 = <f){P), where is Hadamard differentiable at 
P tangentially to some subspace, with P, F n , and P^| viewed as maps from some Donsker 
class J 7 to E such that is measurable for every 5 > 0, where J-~s = {/ — g '■ f, g € 
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1 /2 

J~,pp(f — g) < 5} and pp(f) = (P(f — Pf) 2 ) ■ Additionally, assume that £(Q n (P)) is a 
function of the distribution of y/n((j)(¥ n ) — (j)(P)) which is continuous in the space of such 
distributions with respect to a metric that metrizes weak convergence. Then, 

s 

S- iy Et(Qn<tf!l))-Z{Qn(P))$0 

asn-} oo ; for any sequence b — > oo and for any fixed s. 

See the appendix for a proof of this theorem, as well as for proofs of all other results 
in this section. Note that the assumptions of Theorem [T] are standard in analysis of the 
bootstrap and in fact hold in many practically interesting cases. For example, M-estimators 



are generally Hadamard differentiable (under some regularity conditions) (van der Vaart 



1998 van der Vaart and Wellner , 1996 ), and the assumptions on £ are satisfied if, for example, 



£ computes a cdf value. Theorem [T] can also be generalized to hold for sequences s — > oo and 
more general forms of £, but such generalization appears to require stronger assumptions, 
such as uniform integrability of the £(Qn0P„&)); the need for stronger assumptions in order 
to obtain more general consistency results has also been noted in prior work on the bootstrap 
(e.g., 



sec 



Hahn (1995)). 



Moving beyond analysis of the asymptotic consistency of BLB, we now characterize its 
higher-order correctness (i.e., the rate of convergence of its output to £(Q n (P))). A great 
deal of prior work has been devoted to showing that the bootstrap is higher-order correct 
in many cases (e.g., see the seminal book by Hall (1992)), meaning that it converges to the 
true value £(Q n (P)) at a rate of Op(l/n) or faster. In contrast, methods based on analytical 
asymptotic approximation are generally correct only at order Op(l/y/n). The bootstrap 
converges more quickly due to its more data-driven nature, which allows it to better capture 
finite-sample deviations of the distribution of 9 n from its asymptotic limiting distribution. 

As shown by the following theorem, BLB shares the same degree of higher-order correct- 
ness as the bootstrap, assuming that s and b are chosen to be sufficiently large. Importantly, 
sufficiently large values of b here can still be significantly smaller than n, with b/n — > as 
n — > oo. Following prior analyses of the bootstrap, we now make the standard assumption 
that £ can be represented via an asymptotic series expansion in powers of 1/y/n. In fact, 
prior work provides such expansions in a variety of settings. When £ computes a cdf value, 
these expansions are termed Edgeworth expansions; if £ computes a quantile, then the rel- 



evant expansions are Cornish-Fisher expansions. See Hall (1992) for a full development of 



such expansions both in generality as well as for specific forms of the estimator, including 
smooth functions of mean-like statistics and curve estimators. 



Theorem 2. Suppose that £(Q n (P)) admits an expansion as an asymptotic series 

Pi Pk I 1 



Z(Q n (P)) = z + 



n *~ n k / 2 



+ o 



n 



k/2 



(2) 



where z is a constant independent of P and the pk are polynomials in the moments of P. 
Additionally, assume that the empirical version of £,{Q n {P)) for any j admits a similar 
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expansion 

Aj) A3) / i \ 

+ % + + % + [n^) > ( 3 ) 

where z is as defined above and the p^ are polynomials in the moments o/P^j, obtained by 
replacing the moments of P in the pt with those of F^\. Then, assuming that b < n and 
E(p^) 2 < oo for k e {1,2}, 



i=l 

(4) 

Therefore, taking s = f)(n Var(pj^ — pjt|P n )) and 6 = fi(\/n) yields 

3=1 

m which case BLB enjoys the same level of higher-order correctness as the bootstrap. 

Note that it is natural to assume above that £(<3n(P„l)) can be expanded in powers of 
l/\/n, rather than 1/y/b, because Q n (P^) is the distribution of the estimate computed on 
n points sampled from P^|. The fact that only b points are represented in P^| enters via 

the p^\ which are polynomials in the sample moments of those b points. 

Theorem [2] indicates that, like the bootstrap, BLB can converge at rate Op(l/n) (as- 
suming that s and b grow at a sufficient rate). Additionally, because Var(^ 1} -p k \F n ) is 
decreasing in probability as b and n increase, s can grow significantly more slowly than n 
(indeed, unconditionally, p^ —p k = Op(l/Vb)). While Var(p^ — Pfc|P n ) can in principle be 
computed given an observed dataset, as it depends only on P„ and the form of the estimator 
under consideration, we can also obtain a general upper bound (in probability) on the rate 
of decrease of this conditional variance: 

Remark 1. Assuming that E(p^) 4 < 00, Var(pj^ — Pfe|P„) = P (l/y/n) + 0(1/6). 

The following result, which applies to the alternative variant of BLB that constrains 
the s randomly sampled subsets to be disjoint, also highlights the fact that s can grow 
substantially more slowly than n: 

Theorem 3. Under the assumptions of Theorem \^ and assuming that BLB uses disjoint 
random subsets of the observed data (rather than simple random subsamples), we have 

s 

3=1 



O, 




+ 0i 



o 
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Therefore, if s ~ (n/b) and b = Vt(y/n), then 



iy Et(Q»$ib)-Z{Qn(P)) 



3=1 



o, 



n 



in which case BLB enjoys the same level of higher-order correctness as the bootstrap. 

Finally, while the assumptions of the two preceding theorems generally require that £ 
studentizes the estimator under consideration (which involves dividing by an estimate of 
standard error), similar results hold even if the estimator is not studentized. In particular, 
not studentizing slows the convergence rate of both the bootstrap and BLB by the same 
factor, generally causing the loss of a factor of Op(l/y/n) (van der Vaart, 1998). 



3.2 Simulation Study 

We investigate empirically the statistical performance characteristics of BLB and compare 
to the statistical performance of existing methods via experiments on simulated data. Use 
of simulated data is necessary here because it allows knowledge of P, Q n (P), and hence 
£{Qn{P))\ this ground truth is required for evaluation of statistical correctness. For different 
datasets and estimation tasks, we study the convergence properties of BLB as well as the 
bootstrap, the m out of n bootstrap, and subsampling. 

We consider two different settings: regression and classification. For both settings, the 
data have the form Xj = (Xj,Yj) ~ P, i.i.d. for % = 1,. . . ,n, where Xj e M. d ; ^ S 1 for 
regression, whereas Y,, G {0, 1} for classification. In each case, 9 n estimates a parameter 
vector in M. d for a linear or generalized linear model of the mapping between Xj and Y{. 
We define £ as a procedure that computes a set of marginal 95% confidence intervals, one 
for each element of the estimated parameter vector. In particular, given an estimator's 
sampling distribution Q (or an approximation thereof), £ computes the boundaries of the 
relevant confidence intervals as the 2.5th and 97.5th percentiles of the marginal component- 
wise distributions defined by Q (averaging across £'s simply consists of averaging these 
percentile estimates). 

To evaluate the various quality assessment procedures on a given estimation task and true 
underlying data distribution P, we first compute the ground truth £(Q n (P)) by generating 
2, 000 realizations of datasets of size n from P, computing 9 n on each, and using this collec- 
tion of n 's to form a high-fidelity approximation to Q n (P). Then, for an independent dataset 
realization of size n from the true underlying distribution, we run each quality assessment 
procedure (without parallelization) until it converges and record the estimate of £(Q n (P)) 
produced after each iteration (e.g., after each bootstrap resample or BLB subsample is pro- 
cessed), as well as the cumulative processing time required to produce that estimate. Every 
such estimate is evaluated based on the average (across dimensions) relative deviation of its 
component-wise confidence intervals' widths from the corresponding true widths; given an 
estimated confidence interval width c and a true width c a , the relative deviation of c from 
c is defined as |c — c \/c . We repeat this process on five independent dataset realizations 
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of size n and average the resulting relative errors and corresponding processing times across 
these five datasets to obtain a trajectory of relative error versus time for each quality assess- 
ment procedure. The relative errors' variances are small relative to the relevant differences 
between their means, and so these variances are not shown in our plots. Note that we eval- 
uate based on confidence interval widths, rather than coverage probabilities, to control the 
running times of our experiments: in our experimental setting, even a single run of a quality 
assessment procedure requires non-trivial time, and computing coverage probabilities would 
require a large number of such runs. All experiments in this section were implemented and 
executed using MATLAB on a single processor. To maintain consistency of notation, we 
refer to the m out of n bootstrap as the b out of n bootstrap throughout the remainder of 
this section. For BLB, the b out of n bootstrap, and subsampling, we consider b = n 7 with 
7 G {0.5, 0.6, 0.7, 0.8, 0.9}; we use r = 100 in all runs of BLB. 

In the regression setting, we generate each dataset from a true underlying distribution P 
consisting of either a linear model Y{ = Xfld + or a model Yi = Xfld + Xf Xi + e« having 
a quadratic term, with d = 100 and n = 20, 000. The and 6{ are drawn independently 
from one of the following pairs of distributions: Xi ~ Normal(0, Id) with e« ~ Normal(0, 10); 
Xij ~ StudentT(3) i.i.d. for j = l,...,d with 6j ~ Normal(0, 10); or Ay ~ Gamma(l + 
5(j — 1)/ max(c? — 1, 1), 2) — 2[1 + 5(j — 1)/ max(c? — 1, 1), 2] independently for j = 1, . . . , d 
with ti ~ Gamma(l,2) — 2. All of these distributions have EXi = Eei = 0, and the last 
Xi distribution has non-zero skewness which varies among the dimensions. In the regression 
setting under both the linear and quadratic data generating distributions, our estimator 9 n 
consists of a linear (in Aj) least squares regression with a small L 2 penalty on the parameter 
vector to encourage numerical stability (we set the weight on this penalty term to 10~ 5 ). 
The true average (across dimensions) marginal confidence interval width for the estimated 
parameter vector is approximately 0.1 under the linear data generating distributions (for all 
Xi distributions) and approximately 1 under the quadratic data generating distributions. 

Figure [T] shows results for the regression setting under the linear and quadratic data 
generating distributions with the Gamma and StudentT Xi distributions; similar results 
hold for the Normal Xi distribution. In all cases, BLB (top row) succeeds in converging to 
low relative error significantly more quickly than the bootstrap, for all values of b considered. 
In contrast, the b out of n bootstrap (middle row) fails to converge to low relative error for 
smaller values of b (below n ' 7 ). Additionally, subsampling (bottom row) performs strictly 
worse than the b out of n bootstrap, as subsampling fails to converge to low relative error 
for both smaller and larger values of b (e.g., for b = n 09 ). Note that fairly modest values 
of s suffice for convergence of BLB (recall that s values are implicit in the time axes of 
our plots), with s at convergence ranging from 1-2 for b = n a9 up to 10-14 for b = ra 05 , 
in the experiments shown in Figure [T] larger values of s are required for smaller values 
of b, which accords with both intuition and our theoretical analysis. Under the quadratic 
data generating distribution with StudentT Xi distribution (plots not shown), none of the 
procedures (including the bootstrap) converge to low relative error, which is unsurprising 
given the StudentT(3) distribution's lack of moments beyond order two. 

For the classification setting, we generate each dataset considered from either a linear 
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Figure 1: Relative error vs. processing time for regression setting. The top row shows BLB 
with bootstrap (BOOT), the middle row shows b out of n bootstrap (BOFN), and the bottom 
row shows subsampling (SS). For BLB, BOFN, and SS, b = n< with the value of 7 for each 
trajectory given in the legend. The left column shows results for linear regression with linear 
data generating distribution and Gamma Xi distribution. The middle column shows results 
for linear regression with quadratic data generating distribution and Gamma Xi distribution. 
The right column shows results for linear regression with linear data generating distribution 
and StudentT Xi distribution. 
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Figure 2: Relative error vs. processing time for classification setting with n = 20, 000. 
The top row shows BLB with bootstrap (BOOT); bottom row shows b out of n bootstrap 
(BOFN). For both BLB and BOFN, b = n 1 with the value of 7 for each trajectory given in 
the legend. The left column shows results for logistic regression with linear data generating 
distribution and Gamma Xj distribution. The middle column shows results for logistic 
regression with quadratic data generating distribution and Gamma Xj distribution. The 
right column shows results for logistic regression with linear data generating distribution 
and StudentT Xj distribution. 

model Yi ~ Bernoulli((l + exp(— Xfl)) -1 ) or a model lj ~ Bernoulli((l + exp(— Xjl — 
XfXi))" 1 ) having a quadratic term, with d = 10. We use the three different distributions 
on Xj defined in the regression setting. Our estimator, under both the linear and quadratic 
data generating distributions, consists of a linear (in Xj) logistic regression fit via Newton's 
method, again using an L2 penalty term with weight 10~ 5 to encourage numerical stability. 
For this estimation task with n = 20, 000, the true average (across dimensions) marginal 
confidence interval width for the estimated parameter vector is approximately 0.1 under the 
linear data generating distributions (for all Xj distributions) and approximately 0.02 under 
the quadratic data generating distributions. 

Figure [2] shows results for the classification setting under the linear and quadratic data 
generating distributions with the Gamma and StudentT Xj distributions, and n = 20, 000 
(as in Figure [l]); results for the Normal Xj distribution are qualitatively similar. Here, the 
performance of the various procedures is more varied than in the regression setting. The 
case of the linear data generating distribution with Gamma Xj distribution (left column of 
Figure [2]) appears to be the most challenging. In this setting, BLB converges to relative error 
comparable to that of the bootstrap for b > n 0M , while converging to higher relative errors 
for the smallest values of b considered. For the larger values of b, which are still significantly 
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Figure 3: Relative error (after convergence) vs. n for BLB, the b out of n bootstrap (BOFN), 
and the bootstrap (BOOT) in the classification setting. For both BLB and BOFN, b = n 7 
with the relevant values of 7 given in the legend. The left plot shows results for logistic re- 
gression with linear data generating distribution and Gamma Xi distribution. The right plot 
shows results for logistic regression with linear data generating distribution and StudentT 
Xi distribution. 



smaller than n, we again converge to low relative error faster than the bootstrap. We are 
also once again more robust than the b out of n bootstrap, which fails to converge to low 
relative error for b < n - 7 . In fact, even for b < n os , BLB's performance is superior to that 
of the b out of n bootstrap. Qualitatively similar results hold for the other data generating 
distributions, but with BLB and the b out of n bootstrap both performing better relative 
to the bootstrap. In the experiments shown in Figure [2j the values of s (which are implicit 
in the time axes of our plots) required for convergence of BLB range from 1-2 for b = n 9 
up to 10-20 for b < n 06 (for cases in which BLB converges to low relative error). As in the 
regression setting, subsampling (plots not shown) has performance strictly worse than that 
of the b out of n bootstrap in all cases. 

To further examine the cases in which BLB (when using small values of b) does not 
converge to relative error comparable to that of the bootstrap, we explore how the various 
procedures' relative errors vary with n. In particular, for different values of n (and b), we 
run each procedure as described above and report the relative error that it achieves after 
it converges (i.e., after it has processed sufficiently many subsets, in the case of BLB, or 
resamples, in the case of the b out of n bootstrap and the bootstrap, to allow its output to 
stabilize). Figure [3] shows results for the classification setting under the linear data generating 
distribution with the Gamma and StudentT Xi distributions; qualitatively similar results 
hold for the Normal Xi distribution. As expected based on our previous results for fixed n, 
BLB's relative error here is higher than that of the bootstrap for the smallest values of b 
and n considered. Nonetheless, BLB's relative error decreases to that of the bootstrap as 
n increases — for all considered values of 7, with b = n 1 — in accordance with our theoretical 
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analysis; indeed, as n increases, we can set b to progressively more slowly growing functions 
of n while still achieving low relative error. Furthermore, BLB's relative error is consistently 
substantially lower than that of the b out of n bootstrap and decreases more quickly to the 
low relative error of the bootstrap as n increases. 



4 Computational Scalability 

The experiments of the preceding section, though primarily intended to investigate statistical 
performance, also provide some insight into computational performance: as seen in Figures [T] 
and[2j when computing on a single processor, BLB generally requires less time, and hence less 
total computation, than the bootstrap to attain comparably high accuracy. Those results 
only hint at BLB's superior ability to scale computationally to large datasets, which we 
now demonstrate in full in the following discussion and via large-scale experiments on a 
distributed computing platform. 

As discussed in Section [2j modern massive datasets often exceed both the processing and 
storage capabilities of individual processors or compute nodes, thus necessitating the use 
of parallel and distributed computing architectures. As a result, the scalability of a quality 
assessment method is closely tied to its ability to effectively utilize such computing resources. 

Recall from our exposition in preceding sections that, due to the large size of bootstrap 
resamples, the following is the most natural avenue for applying the bootstrap to large-scale 
data using distributed computing: given data partitioned across a cluster of compute nodes, 
parallelize the estimate computation on each resample across the cluster, and compute on 
one resample at a time. This approach, while at least potentially feasible, remains quite 
problematic. Each computation of the estimate will require the use of an entire cluster 
of compute nodes, and the bootstrap repeatedly incurs the associated overhead, such as 
the cost of repeatedly communicating intermediate data among nodes. Additionally, many 



cluster computing systems currently in widespread use (e.g., Hadoop MapReduce (Hadoop 



2012)) store data only on disk, rather than in memory, due to physical size constraints (if 
the dataset size exceeds the amount of available memory) or architectural constraints (e.g., 
the need for fault tolerance). In that case, the bootstrap incurs the extreme costs associated 
with repeatedly reading a very large dataset from disk — reads from disk are orders of mag- 
nitude slower than reads from memory. Though disk read costs may be acceptable when 
(slowly) computing only a single full-data point estimate, they easily become prohibitive 
when computing many estimates on one hundred or more resamples. Furthermore, as we 
have seen, executing the bootstrap at scale requires implementing the estimator such that it 
can be run on data distributed over a cluster of compute nodes. 

In contrast, BLB permits computation on multiple (or even all) subsamples and resam- 
ples simultaneously in parallel, allowing for straightforward distributed and parallel imple- 
mentations which enable effective scalability and large computational gains. Because BLB 
subsamples and resamples can be significantly smaller than the original dataset, they can 
be transferred to, stored by, and processed independently on individual (or very small sets 
of) compute nodes. For instance, we can distribute subsamples to different compute nodes 
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and subsequently use intra-node parallelism to compute across different resamples generated 
from the same subsample. Note that generation and distribution of the subsamples requires 
only a single pass over the full dataset (i.e., only a single read of the full dataset from disk, if 
it is stored only on disk), after which all required data (i.e., the subsamples) can potentially 
be stored in memory. Beyond this significant architectural benefit, we also achieve imple- 
mentation and algorithmic benefits: we do not need to parallelize the estimator internally to 
take advantage of the available parallelism, as BLB uses this available parallelism to compute 
on multiple resamples simultaneously, and exposing the estimator to only b rather than n 
distinct points significantly reduces the computational cost of estimation, particularly if the 
estimator computation scales super-linearly. 

Given the shortcomings of the m out of n bootstrap and subsampling illustrated in the 
preceding section, we do not include these methods in the scalability experiments of this 
section. However, it is worth noting that these procedures have a significant computational 
shortcoming in the setting of large-scale data: the m out of n bootstrap and subsampling 
require repeated access to many different random subsets of the original dataset (in contrast 
to the relatively few, potentially disjoint, subsamples required by BLB), and this access can 
be quite costly when the data is distributed across a cluster of compute nodes. 

We now detail our large-scale experiments on a distributed computing platform. For this 
empirical study, we use the experimental setup of Section 32 with some modification to 
accommodate larger scale and distributed computation. First, we now use d = 3, 000 and n = 
6, 000, 000 so that the size of a full observed dataset is approximately 150 GB. The full dataset 
is partitioned across a number of compute nodes. We again use simulated data to allow 
knowledge of ground truth; due to the substantially larger data size and attendant higher 
running times, we now use 200 independent realizations of datasets of size n to numerically 
compute the ground truth. As our focus is now computational (rather than statistical) 
performance, we present results here for a single data generating distribution which yields 
representative statistical performance based on the results of the previous section; for a 
given dataset size, changing the underlying data generating distribution does not alter the 
computational resources required for storage and processing. For the experiments in this 
section, we consider the classification setting with StudentT Xi distribution. The mapping 
between X$ and Y^ remains similar to that of the linear data generating distribution in 
Section 3.2 but with the addition of a normalization factor to prevent degeneracy when 
using larger d: Yi ~ Bernoulli((l+exp(— Xfl/y/d))' 1 ). We implement the logistic regression 
using L-BFGS ( Nocedal and Wright[ 2006) due to the significantly larger value of d. 

We compare the performance of BLB and the bootstrap, both implemented as described 
above. That is, our implementation of BLB processes all subsamples simultaneously in par- 
allel on independent compute nodes; we use r = 50, s = 5, and b = n ' 7 . Our implementation 
of the bootstrap uses all available processors to compute on one resample at a time, with 
computation of the logistic regression parameter estimates parallelized across the available 
compute nodes by simply distributing the relevant gradient computations among the dif- 



ferent nodes upon which the data is partitioned. We utilize Poisson resampling (van der 



Vaart and Wellner, 1996) to generate bootstrap resamples, thereby avoiding the complexity 
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Figure 4: Relative error vs. processing time for BLB (with b = n ' 7 ) and the bootstrap 
(BOOT) on 150 GB of data in the classification setting. The left plot shows results with 
the full dataset stored only on disk; the right plot shows results with the full dataset cached 
in memory. Because BLB's computation is fully parallelized across all subsamples, we show 
only the processing time and relative error of BLB's final output. 



of generating a random multinomial vector of length n in a distributed fashion. Due to high 
running times, we show results for a single trial of each method, though we have observed 
little variability in qualitative outcomes during development of these experiments. All exper- 
iments in this section are run on Amazon EC2 and implemented in the Scala programming 
language using the Spark cluster computing framework (Zaharia et al. 2012), which pro- 



vides the ability to either read data from disk (in which case performance is similar to that 
of Hadoop MapReduce) or cache it in memory across a cluster of compute nodes (provided 
that sufficient memory is available) for faster repeated access. 

In the left plot of Figure |4| we show results obtained using a cluster of 10 worker nodes, 
each having 6 GB of memory and 8 compute cores; thus, the total memory of the cluster is 
60 GB, and the full dataset (150 GB) can only be stored on disk (the available disk space 
is ample and far exceeds the dataset size). As expected, the time required by the boot- 
strap to produce even a low-accuracy output is prohibitively high, while BLB provides a 
high-accuracy output quite quickly, in less than the time required to process even a single 
bootstrap resample. In the right plot of Figure |4j we show results obtained using a cluster of 
20 worker nodes, each having 12 GB of memory and 4 compute cores; thus, the total memory 
of the cluster is 240 GB, and we cache the full dataset in memory for faster repeated ac- 
cess. Unsurprisingly, the bootstrap's performance improves significantly with respect to the 
previous disk-bound experiment. However, the performance of BLB (which also improves), 
remains substantially better than that of the bootstrap. 
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5 Hyper parameter Selection 



Like existing resampling-based procedures such as the bootstrap, BLB requires the specifi- 
cation of hyperparameters controlling the number of subsamples and resamples processed. 
Setting such hyperparameters to be sufficiently large is necessary to ensure good statistical 
performance; however, setting them to be unnecessarily large results in wasted computation. 
Prior work on the bootstrap and related procedures — which largely does not address compu- 
tational issues — generally assumes that a procedure's user will simply select a priori a large, 
constant number of resamples to be processed (with the exception of Tibshirani (1985), who 
does not provide a general solution for this issue). However, this approach reduces the level 
of automation of these methods and can be quite inefficient in the large data setting, in 
which each subsample or resample can require a substantial amount of computation. 

Thus, we now examine the dependence of BLB's performance on the choice of r and 
s, with the goal of better understanding their influence and providing guidance toward 
achieving adaptive methods for their selection. For any particular application of BLB, we 
seek to select the minimal values of r and s which are sufficiently large to yield good statistical 
performance. 

Recall that in the simulation study of Section 3J2 across all of the settings considered, 
fairly modest values of r (100 for confidence intervals) and s (from 1-2 for b = n 9 up to 
10-20 for b = n 6 ) were sufficient. The left plot of Figure [B] provides further insight into the 
influence of r and s, giving the relative errors achieved by BLB with b = n 0J for different r, s 
pairs in the classification setting with linear data generating distribution and StudentT 
distribution. In particular, note that for all but the smallest values of r and s, it is possible 
to choose these values independently such that BLB achieves low relative error; in this case, 
selecting s > 3, r > 50 is sufficient. 

While these results are useful and provide some guidance for hyperparameter selection, 
we expect the sufficient values of r and s to change based on the identity of £ (e.g., we 
expect a confidence interval to be harder to compute and hence to require larger r than a 
standard error) and the properties of the underlying data. Thus, to help avoid the need to 
set r and s to be conservatively and inefficiently large, we now provide a means for adaptive 
hyperparameter selection, which we validate empirically. 

Concretely, to select r adaptively in the inner loop of Algorithm [TJ we propose an iterative 
scheme whereby, for any given subsample j, we continue to process resamples and update 
£*j until it has ceased to change significantly. Noting that the values Q* nk used to compute 
£*j are conditionally i.i.d. given a subsample, for most forms of £ the series of computed £* ■ 
values will be well behaved and will converge (in many cases at rate 0(l/y/r), though with 
unknown constant) to a constant target value as more resamples are processed. Therefore, it 
suffices to process resamples (i.e., to increase r) until we are satisfied that £* • has ceased to 
fluctuate significantly; we propose using Algorithm [2] to assess this convergence. The same 
scheme can be used to select s adaptively by processing more subsamples (i.e., increasing s) 
until BLB's output value s _1 Yuj=i £n,j nas stabilized; in this case, one can simultaneously 
also choose r adaptively and independently for each subsample. When parallelizing across 
subsamples and resamples, one can simply process batches of subsamples and resamples 
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Figure 5: Results for BLB hyperparameter selection in the classification setting with linear 
data generating distribution and StudentT X^ distribution. The left plot shows the relative 
error achieved by BLB for different values of r and s, with b = n ' 7 . The right plot shows 
relative error vs. processing time (without parallelization) for BLB using adaptive selection 
of r and s (the resulting stopping times of the BLB trajectories are marked by large squares) 
and the bootstrap (BOOT); for BLB, b = n 1 with the value of 7 for each trajectory given in 
the legend. The table gives statistics of the different values of r selected by BLB's adaptive 
hyperparameter selection (across multiple subsamples, with b = n a7 ) when £ is either our 
usual confidence interval width-based quality measure (CI), or a component-wise standard 
error (STDERR); the relative errors achieved by BLB and the bootstrap are comparable in 
both cases. 




(with batch size determined by the available parallelism) until the output stabilizes. 

The right plot of Figure [5] shows the results of applying such adaptive hyperparameter 
selection in a representative empirical setting from our earlier simulation study (without 
parallelization). For selection of r we use e = 0.05 and w = 20, and for selection of s we 
use e = 0.05 and w — 3. As illustrated in the plot, the adaptive hyperparameter selection 
allows BLB to cease computing shortly after it has converged (to low relative error), limiting 
the amount of unnecessary computation that is performed without degradation of statistical 
performance. Though selected a priori, e and w are more intuitively interpretable and less 
dependent on the details of £ and the underlying data generating distribution than r and 
s. Indeed, the aforementioned specific values of e and w yield results of comparably good 
quality when also used for the other data generation settings considered in Section 3.2 , when 
applied to a variety of real datasets in Section [6] below, and when used in conjunction with 
different forms of £ (see the table in Figure |5j which shows that smaller values of r are 
selected when £ is easier to compute). Thus, our scheme significantly helps to alleviate the 
burden of a priori hyperparameter selection. 

Automatic selection of a value of b in a computationally efficient manner would also be 
desirable but is more difficult due to the inability to easily reuse computations performed for 
different values of b. One could consider similarly increasing b from some small value until 



the output of BLB stabilizes (an approach reminiscent of the method proposed in Bickel and 



Sakov (2008) for the m out of n bootstrap); devising a means of doing so efficiently is the 



subject of future work. Nonetheless, based on our fairly extensive empirical investigation, it 
seems that b = n 0,7 is a reasonable and effective choice in many situations. 
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Algorithm 2: Convergence Assessment 



Input: A series z (2) , . . . , z^> G R d 

w G N: window size (< t) 

eGl: target relative error (> 0) 
Output: true if and only if the input series is deemed to have ceased to fluctuate 
beyond the target relative error 

if vj g [i, w], i Eti ''%wf <W| < e then 

return true 
else 

return false 
end 



6 Real Data 

In this section, we present the results of applying BLB to several different real datasets. In 
this case, given the absence of ground truth, it is not possible to objectively evaluate the 
statistical correctness of any particular estimator quality assessment method; rather, we are 
reduced to comparing the outputs of various methods (in this case, BLB, the bootstrap, and 
the b out of n bootstrap) to each other. Because we cannot determine the relative error 
of each procedure's output without knowledge of ground truth, we now instead report the 
average (across dimensions) absolute confidence interval width yielded by each procedure. 
Figure [6] shows results for BLB, the bootstrap, and the b out of n bootstrap on the 



UCI connect4 dataset (Frank and Asuncion, 2010), where the model is logistic regression 



(as in the classification setting of our simulation study above), d = 42, and n = 67,557. 
We select the BLB hyperparameters r and s using the adaptive method described in the 
preceding section. Notably, the outputs of BLB for all values of b considered, and the output 
of the bootstrap, are tightly clustered around the same value; additionally, as expected, BLB 
converges more quickly than the bootstrap. However, the values produced by the b out of n 
bootstrap vary significantly as b changes, thus further highlighting this procedure's lack of 
robustness. We have obtained qualitatively similar results on six additional datasets from 



the UCI dataset repository (ct-slice, magic, millionsong, parkinsons, poker, shuttle) (Frank 



and Asuncion, 2010) with different estimators (linear regression and logistic regression) and 



a range of different values of n and d (see the appendix for plots of these results). 



7 Time Series 



While we have focused thus far on the setting of i.i.d. data, variants of the bootstrap — 
such as the moving block bootstrap and the stationary bootstrap — have been proposed 



that of time series ( 
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Figure 6: Average (across dimensions) absolute confidence interval width vs. processing time 
on the UCI connect4 dataset (logistic regression, d = 42, n = 67,557). The left plot shows 
results for BLB (using adaptive hyperparameter selection, with the output at convergence 
marked by large squares) and the bootstrap (BOOT). The right plot shows results for the b 
out of n bootstrap (BOFN). For both BLB and BOFN, b = n 1 with the value of 7 for each 
trajectory given in the legend. 



1994). These bootstrap variants can be used within BLB, in computing the requisite plugin 
approximations £(Qn0Pnl))' ^° obtain variants of our procedure which are applicable in non- 
i.i.d. settings. The advantages (e.g., with respect to scalability) of such BLB variants over 
variants of the bootstrap (and its relatives) remain identical to the advantages discussed 
above in the context of large-scale i.i.d. data. We briefly demonstrate the extensibility of 
BLB by combining our procedure with the stationary bootstrap (Politis and Romano, 1994) 
to obtain a "stationary BLB" which is suitable for assessing the quality of estimators applied 
to large-scale stationary time series data. 

To extend BLB in this manner, we must simply alter both the subsample selection mech- 
anism and the resample generation mechanism such that both of these processes respect the 
underlying data generating process. In particular, for stationary time series data it suffices 
to select each subsample as a (uniformly) randomly positioned block of length b within the 
observed time series of length n. Given a subsample of size b, we generate each resample by 
applying the stationary bootstrap to the subsample to obtain a series of length n. That is, 
given p G [0, 1] (a hyperparameter of the stationary bootstrap), we first select uniformly at 
random a data point in the subsample series and then repeat the following process until we 
have amassed a new series of length n: with probability 1 — p we append to our resample 
the next point in the subsample series (wrapping around to the beginning if we reach the 
end of the subsample series), and with probability p we (uniformly at random) select and 
append a new point in the subsample series. Given subsamples and resamples generated in 
this manner, we execute the remainder of the BLB procedure as described in Algorithm [I] 

We now present simulation results comparing the performance of the bootstrap, BLB, the 
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Method Standard Stationary 



BLB-0.6 


2.2 ±.1 


4.2 ± 


.1 


BLB-0.7 


2.2 ± .04 


4.5 db 


.1 


BLB-0.8 


2.2 ±.1 


4.6 ± 


.2 


BLB-0.9 


2.2 ±.1 


4.6 ± 


.1 


BOOT 


2.2 db .1 


4.6 db 


.2 



Table 1: Comparison of standard and stationary bootstrap (BOOT) and BLB on stationary 
time series data with n — 5, 000. We report the average and standard deviation of estimates 
(after convergence) of the standard deviation of the rescaled mean aggregated over 10 trials. 
The true population value of the standard deviation of the rescaled mean is approximately 5. 



stationary bootstrap, and stationary BLB. In this experiment, initially introduced by Poli- 



tis and Romano (1994), we generate observed data consisting of a stationary time series 



Xi, . . . , X n e R where X t = Z t + Z t -i + Zt-i + Z t _ 3 + Z t _ 4 and the Z t are drawn inde- 
pendently from Normal(0, 1). We consider the task of estimating the standard deviation of 
the rescaled mean ~Y^ = iX t / ^Jn, which is approximately 5; we set p = 0.1 for the stationary 
bootstrap and stationary BLB. The results in Table [I] (for n = 5, 000) show the improvement 
of the stationary bootstrap over the bootstrap, the similar improvement of stationary BLB 
over BLB, and the fact that the statistical performance of stationary BLB is comparable to 
that of the stationary bootstrap for b > n 7 . Note that this exploration of stationary BLB is 
intended as a proof of concept, and additional investigation would help to further elucidate 
and perhaps improve the performance characteristics of this BLB extension. 



8 Conclusion 

We have presented a new procedure, BLB, which provides a powerful new alternative for 
automatic, accurate assessment of estimator quality that is well suited to large-scale data 
and modern parallel and distributed computing architectures. BLB shares the favorable 
statistical properties (i.e., consistency and higher-order correctness) and generic applicability 
of the bootstrap, while typically having a markedly better computational profile, as we have 
demonstrated via large-scale experiments on a distributed computing platform. Additionally, 
BLB is consistently more robust than the m out of n bootstrap and subsampling to the 
choice of subset size and does not require the use of analytical corrections. To enhance 
our procedure's computational efficiency and render it more automatically usable, we have 
introduced a means of adaptively selecting its hyperparameters. We have also applied BLB 
to several real datasets and presented an extension to non-i.i.d. time series data. 

A number of open questions and possible extensions remain. Though we have constructed 
an adaptive hyperparameter selection method based on the properties of the subsampling 
and resampling processes used in BLB, as well as empirically validated the method, it would 
be useful to develop a more precise theoretical characterization of its behavior. Additionally, 
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as discussed in Section [5j it would be beneficial to develop a computationally efficient means 
of adaptively selecting b. It may also be possible to further reduce r by using methods that 



have been proposed for reducing the number of resamples required by the bootstrap (Efron 



1988 Efron and Tibshirani, 1993) 



Furthermore, it is worth noting that, while BLB shares the statistical strengths of the 
bootstrap, we conversely do not expect our procedure to be applicable in cases in which the 



bootstrap fails (Bickel et al. , 1997). Indeed, it was such edge cases that originally motivated 



development of the m out of n bootstrap and subsampling, which are consistent in various 
settings that are problematic for the bootstrap. It would be interesting to investigate the 
performance of BLB in such settings and perhaps use ideas from the m out of n bootstrap 
and subsampling to improve the applicability of BLB in these edge cases while maintaining 
computational efficiency and robustness. Finally, note that averaging the plugin approxima- 
tions ^{Qni^nl)) in equation (1) implicitly corresponds to minimizing the squared error of 
BLB's output. It would be possible to specifically optimize for other losses on our estima- 



tor quality assessments by combining the £,(Q r 
rather than averages). 



n,b 



)) in other ways (e.g., by using medians 
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A Appendix: Proofs 



We provide here full proofs of the theoretical results included in Section 3.1 above. 
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A.l Consistency 



We first define some additional notation, following that used by van der Vaart and Well- 



ner 



(1996). Let / 00 (J-') be the set of all uniformly bounded real functions on J 7 , and let 
BLi(Z°°(7")) denote the set of all functions h : Z°°(.F) [0,1] such that \h(z x ) - h(z 2 )\ < 
\\ z i ~ ^2 1| j 7 ) V^i, Z2 G / 00 (J r ), where || • || j- is the uniform norm for maps from J 7 to M. We define 
Pf to be the expectation of f(X) when X ~ P; as suggested by this notation, throughout 
this section we will view distributions such as P, P n , and b as maps from some function 
class J- to R. E(-)* and P(-)* denote the outer and inner expectation of (■), respectively, 
and we indicate outer probability via P*. X = Y denotes that the random variables X and 
Y are equal in distribution, and is defined as the set {/ — g : /, g G J 7 , pp(f — g) < 5}, 
where pp(-) is the variance semimetric: pp(f) = {P{f — Pf) 2 ) 1 ^ 2 - 



Follow ing prior analyses of the bootstrap (Gine and Zinn 1990 van der Vaart and Well- 



ncr 



1996), we first observe that, conditioning on P^ for any j as b, n — > oo, resamples from 



the subsampled empirical distribution \ behave asymptotically as though they were drawn 
directly from P, the true underlying distribution: 



Lemma 1. Given for any j, let X*, . . . , X* ~ i.i.d., and define P* b = 
Additionally, we define the resampled empirical process G* b = ^Jn{¥* n b — P„|) 



-i 



X* ■ 



Then, for T 



a Donsker class of measurable functions such that is measurable for every 5 > 0, 

sup E pU )h(G* nb )-Eh(Gp) 

h&BL 1 (l°°{T)) n ' b 



5o, 



as n — > oo, for any sequence b — > oo, where £L(j) denotes expectation conditional on the 

n,b 

contents of the subscript andGp is a P-Brownian bridge process. "Furthermore, the sequence 
E (j)h{G* nb )* — E _y)/i(G* 6 )* converges to zero in probability for every h G 5Li(/ 00 (J r )). If 

n,b n,b 

-P*||/ — Pf\\jr < oo, then the convergence is also outer almost surely." (van der Vaart and 



Wellner, 1996) 



Proof. Note that = F b . Hence, applying Theorem 3.6.3 of 
(1996) with the identification (n, k n ) <r> (b,n) yields the desired 



van der Vaart and Wellner 
result. □ 



Lemma [l] states that, conditionally on the sequence P^, 



the sequence of processes G* b 

converges in distribution to the P-Brownian bridge process Gp, in probability. Noting that 
the empirical process G„ = y/n(F n — P) also converges in distribution to Gp (recall that J 7 is 
a Donsker class by assumption), it follows that size n resamples generated from P^| behave 
asymptotically as though they were drawn directly from P. Under standard assumptions, it 

then follows that £(Q„(P$)) - £{Qn{P)) 4 0: 

Lemma 2. Under the assumptions of Theorem^ for any j , 

W»0P$)) - Z(Qn(P)) 4 o 



as n — > oo, for any sequence b — > oo. 
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Proof. Let R be the random element to which \/n(0(P n ) — <P{P)) converges in distribution; 
note that the functional delta method (van der Vaart |1998 ) provides the form of R in terms 
of 4> and P. The delta method for the bootstrap (see Theorem 23.9 of van der Vaart (1998)) 
in conjunction with Lemma LU implies that, under our assumptions, • v /n(0(P* i6 ) — 0(P^|)) also 
converges conditionally in distribution to R, given P„[, in probability. Thus, the distribution 
of y/n((j)(F n ) - 4>(P)) and the distribution of y/n((j)(F* nb ) - </>(P^{)), the latter conditionally 



on Pjfj, have the same asymptotic limit in probability. As a result, given the assumed 

continuity of £, it follows that £(Q n (Pn&)) anc ^ £,(Qn(P)) have the same asymptotic limit, in 
probability. □ 



, (V u jf],)) is asymptotically consistent 



The above lemma indicates that each individual £ (0, 
as b, n — > oo. Theorem [T] immediately follows: 

Proof of Theorem^ Lemma[2]in conjunction with the continuous mapping theorem (van der 
Vaart, 1998) implies the desired result. □ 



A. 2 Higher-Order Correctness 

We first prove two supporting lemmas. 

Lemma 3. Assume that X\, . . . ,X b ~ P are i.i.d., and let p k (Xi, . . . , X b ) be the sam- 
ple version of pk based on Xi,...,X b , as defined in Theorem^ Then, assuming that 
E[p k (X 1 ,...,X b ) 2 ]<oo, 

Vai(p k {X x , ...,X b )- Pk ) = Var(p fc (Xi, . . .,X b )) = 0(1/6). 

Proof. By definition, the pk are simply polynomials in sample moments. Thus, we can write 

B A f) / b 



p k = p k (Xi, . . . , X b ) = cp II ( 6-1 E 9 ( a\ X i 

/3=1 a=l 



(6) 



i=i 



where each g& raises its argument to some power. Now, observe that for any (3, 

a=l \ i=l 

is a V-statistic of order Ap applied to the b observations X\, . . . , X b . Let hp(xi, . . . , XAa) 
denote the kernel of this V-statistic, which is a symmetrized version of Yla=i 

g ( £\x a ). It 

follows that pk = J2p=i c pVp i s itself a V-statistic of order A = m&xp Ap with kernel 
h(xi, . . . , xa) = J2p=i c phj3(xi, . . . , XA.p), applied to the b observations Xi, . . . , X b . Let U 
denote the corresponding U-statistic having kernel h. Then, using Proposition 3.5(H) and 



Corollary 3.2(i) of Shao (2003), we have 



A 



Var(p fe - p k ) = Vai(pk) = Var([/) + 0(b~ 2 ) < - Var(/i(Xi, . . . , X A )) + 0(b 



-2\ 



0(1/6). 
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□ 



Lemma 4. Assume that X\, . . . ,Xb ~ P are i.i.d., and let p k (Xi, . . . , Xb) be the sam- 
ple version of p k based on Xi,...,X b , as defined in Theorem^ Then, assuming that 
E\p k (X 1 , ...,X b )\ <oo, 

\E[p k (X 1 ,...,X b )]-p k \ = 0(l/b). 

Proof. As noted in the proof of Lemma [3j we can write 

b A p / b 

p k {Xx, ■ - - , X h ) = C P J! E ^\Xr 



=1 a=l 



i=l 



where each raises its argument to some power. Similarly, 

B Ap 

^E^IW^ra, 

0=1 a=l 



and so 

\E\p k (Xi, . . .,X b )} -p k \ = 

< 



B Ap / 


b 


0=1 a= 


i \ 




B 






Em ■ 


E 


n 

a=l \ 



f3=l a=l 



a=l 



Given that the number of terms in the outer sum on the right-hand side is constant with 
respect to b, to prove the desired result it is sufficient to show that, for any (3, 



E 



Ap 



n 6_i e^ 

a=l V i=l 



a=l 



O 



Observe that 



E 



n 6_i e^ 

a=l \ i=l 



b~ A PE 



E IK^ 



If i u . . . , i Ap are all distinct, then E Y[ a L x 9a\ X i a ) = Il2=i Eg£ } {Xx) because X ly . . . ,X b 
are i.i.d.. Additionally, the right-hand summation in ([7]) has b\/(b — Ap)\ terms in which 



flAp=l a =i 



(7) 
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ix,...,iA are all distinct; correspondingly, there are b A & —b\/(b — Ap)\ terms in which 



3a, a' s.t. in = i„t. Therefore, it follows that 



E 



a=l \ i=l 



b\ 



(b-A,)\l\ 



n^(^) + e E u^( x i. 



l<ii,---,iA^^6 a=l 
3a, a' s.t. i a =i„i 



and 







"As / 






n 






a=l \ 



n ^e^w 



a=l 



ft"*" 



< 6- A " 



b\ 



{b-Ap)\ 



b\ 



a=l l<'iv>U»<! ) a=l 



3a,a' s.t. i a =i„/ 



(6-A»)J 



b A, 



a=l 



6! 



C, 



where 



C 



l<zx,...,U^<6 



max 

3a,a' s.t. i a =i a i 

Note that C is a constant with respect to b. Also, simple algebraic manipulation shows that 



b A P - Kb ^ + 0(6^-^) 



(6-^)! 

for some k > 0. Thus, plugging into equation (pj), we obtain the desired result: 

Ac 



A/s < b- A ? Inb^- 1 + 0(b 



\A«-2\ 



«=i 



l^- 1 + o(b A ?- 2 )\ c = o 



□ 



We now provide full proofs of Theorem |2j Remark [TJ and Theorem [3j 
Proof of Theorem^ Summing the expansion (|3j) over j, we have 

s- 1 E w»opS)) = * + n- 1 ' 2 *- 1 E# + v 1 E# + op (-) . 
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Subtracting the corresponding expansion ^ for £(Q n (P)), we then obtain 



< n 



-1/2 



Pi 



j'=i 



'E 



77 



(9) 

We now further analyze the first two terms on the right-hand side of the above expression; 
for the remainder of this proof, we assume that k e {1,2}. Observe that, for fixed k, the 
p^P are conditionally i.i.d. given X\, . . . ,X n for all j, and so 



Var ( s- 1 (\Pk } - Pk] - E[p^ - p h \ 

V j=i 



Var^-pfclP, 



P,. 



where we denote by E\p^ — Pfc|P n ] and Vax(p^ — pfc|P 
of pjp — pk over realizations of P^ conditionally on Xi, 



a permutation-symmetric function of size b subsets of Xi, . . . ,X n , E[p\, — Pfc|P n ] is a Ti- 



the expectation and variance 

. , X n . Now, given that p^ is 

(i) 



statistic of order b. Hence, we can apply Corollary 3.2 (i) of Shao (2003) in conjunction with 
Lemma [3] to find that 



b 

n 



< — Vai(p^ 



Var - Pk \F n ] - E\pf - p fc ]) = Var (e^ - p fc |P. 

From the result of Lemma |4| we have 

■Pk]\ = 

Combining the expressions in the previous three panels, we find that 



Pk) 



O 



\E[p k l) - Vk }\ = 0(\ 



Pk 



3=1 



Oj 




+ 



Finally, plugging into equation ^ with k — 1 and k = 2, we obtain the desired result. 
Proof of Remark [IJ Observe that 



□ 



Vax^-pfclPn) < E\ 



t ] -Pk) 2 \ 



E\ 



i l) -pk) 2 \ 



E\ 



i l) -Pk) 2 }+E[(p^- Pk ) 



Given that p£' is a polynomial in the moments of P^, = (p^ —pk) 2 is also a polynomial 
in the moments of P^|. Hence, Lemma [i] applies to Additionally, is a permutation- 
symmetric function of size b subsets of Xi, . . . , X n , and so £?[g^|P n ] is a U-statistic of order 
b. Therefore, applying Corollary 3.2(i) of Shao (2003) in conjunction with Lemma[3j we find 
that 



Var E 



i ] -Pk?\ 



E\ 



i 1] -Pkr 



Var [E[$ 



Wi 



< - Var(g£ 



O 



n 



30 



Now, 



By Lemmas 3 and 



E[{Pk ] - Pk?] = Var(p^ - p fc ) + £[p^ - Pk 



Var(p^ — pk) = 0(1/6) and — Pk} 2 = 0(l/b 2 ). Combining with 



the expressions in the previous three panels, we obtain the desired result: 

v ar(!i i" -MK) = o P + o (i) + o (I) = o P + o (i) . 



Proof of Theorem [3| As noted in the proof of Theorem |2j 



□ 



< n 



-1/2 



E 

j'=i 



Px -Pi 



+n 



-i 



E 

3=1 



P 2 ~P2 



-Op 



n 



(10) 

Throughout this proof, we assume that k G {1,2}. Under the assumptions of this theorem, 
the are based on disjoint subsets of the n observations and so are i.i.d.. Hence, for any 
k, the p^ are i.i.d. for all j, and so using Lemma 



Var 



C?) 



Pk 



3=1 



Additionally, from the result of Lemma [3J we have 

\E\$ ) -p k ]\=o(±y 

Combining the expressions in the previous two panels, we find that 



Pk 



3=1 



Finally, plugging into equation (10) with k = 1 and k = 2, we obtain the desired result. □ 



B Appendix: Additional Real Data Results 
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Figure 7: Average (across dimensions) absolute confidence interval width vs. processing time 
on the UCI ct-slice dataset (linear regression, d = 385, n = 53,500). The left plot shows 
results for BLB (using adaptive hyperparameter selection, with the output at convergence 
marked by large squares) and the bootstrap (BOOT). The right plot shows results for the b 
out of n bootstrap (BOFN). 
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Figure 8: Average (across dimensions) absolute confidence interval width vs. processing time 
on the UCI magic dataset (logistic regression, d = 10, n — 19,020). The left plot shows 
results for BLB (using adaptive hyperparameter selection, with the output at convergence 
marked by large squares) and the bootstrap (BOOT). The right plot shows results for the b 
out of n bootstrap (BOFN). 
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Figure 9: Average (across dimensions) absolute confidence interval width vs. processing time 
on the UCI millionsong dataset (linear regression, d = 90, n = 50, 000). The left plot shows 
results for BLB (using adaptive hyperparameter selection, with the output at convergence 
marked by large squares) and the bootstrap (BOOT). The right plot shows results for the b 
out of n bootstrap (BOFN). 
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Figure 10: Average (across dimensions) absolute confidence interval width vs. processing time 
on the UCI parkinsons dataset (linear regression, d = 16, n — 5,875). The left plot shows 
results for BLB (using adaptive hyperparameter selection, with the output at convergence 
marked by large squares) and the bootstrap (BOOT). The right plot shows results for the b 
out of n bootstrap (BOFN). 
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Figure 11: Average (across dimensions) absolute confidence interval width vs. processing 
time on the UCI poker dataset (logistic regression, d — 10, n = 50, 000). The left plot shows 
results for BLB (using adaptive hyperparameter selection, with the output at convergence 
marked by large squares) and the bootstrap (BOOT). The right plot shows results for the b 
out of n bootstrap (BOFN). 
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Figure 12: Average (across dimensions) absolute confidence interval width vs. processing 
time on the UCI shuttle dataset (logistic regression, d = 9, n = 43, 500). The left plot shows 
results for BLB (using adaptive hyperparameter selection, with the output at convergence 
marked by large squares) and the bootstrap (BOOT). The right plot shows results for the b 
out of n bootstrap (BOFN). 
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